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[57] ABSTRACT 

As an optimization problem, clustering data (unsupervised 
learning) is known to be a difficult problem. Most practical 
approaches use a heuristic, typically gradient-descent, algo- 
rithm to search for a solution in the huge space of possible 
solutions. Such methods are by definition sensitive to start- 
ing points. It has been well-known that clustering algorithms 
are extremely sensitive to initial conditions. Most methods 
for guessing an initial solution simply make random 
guesses. In this paper we present a method that takes an 
initial condition and efficiently produces a refined starting 
condition. The method is applicable to a wide class of 
clustering algorithms for discrete and continuous data. In 
this paper we demonstrate how this method is applied to the 
popular K-means clustering algorithm and show that refined 
initial starting points indeed lead to improved solutions. The 
technique can be used as an initializer for other clustering 
solutions. The method is based on an efficient technique for 
estimating "the modes of a distribution and runs in time 
guaranteed to be less than overall clustering time for large 
data sets. The method is also scalable and hence can be 
efficiently used on huge databases to refine starting points 
for scalable clustering algorithms in data mining applica- 
tions. 

26 Claims, 5 Drawing Sheets- 
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METHOD FOR REFINING THE INITIAL output from the K-means clustering process is K centroids 

CONDITIONS FOR CLUSTERING WITH and the number of data points that fall within a given 

APPLICATIONS TO SMALL AND LARGE centroid. 

DATABASE CLUSTERING Th Q present invention is concerned with step 1: the 

5 initialization step of choosing the starting centroids. 

FIELD OF THE INVENTION The most widely used clustering procedures in the pattern 

The present invention concerns clustering of data and recognition and statistics literature are members of the above 

more particularly concerns method and apparatus for opti- famil y of lte ™uVe refinement approaches: the K-means 

mizing a starting condition for the clustering of data. algorithm, and the EM algorithm. The difference between 

io the EM and K-means is essentially in the membership 

BACKGROUND ART decision in step 2. In K-means, a data item is assumed to 

Data clustering is important in a variety of fields including belon S to a sin 8 le cl " ster > ^ ™ the EM P"***"* each 

data mining, statistical data analysis, data compression, and ^ ltem 15 ff u , med J°. bel ° n S to eve iy clu f er but ™* a 

vector quantization. Clustering has been formulated in vari- different probability. Tbis of course affects the update step 

ous ways in the machine learning, pattern recognition, 15 < 3 ) ° f the ■fcontbm. In K-means each cluster is updated 

optimization, and statistics literature. The general agreement bas f d V™*** 00 ** membership. In EM each cluster is 

is that the problem is about finding groups (clusters) in data "P da « ed b J ,he entire data set according to fractional mem- 

that consist of data items which are similar to each other. The b " shl P s dete ™d by the relative probability of member- 

most general definition of the clustering problem is to view P' , . . . . . . ... „ . 

it as a density estimation problem. The value of the hidden 20 , Note _ that S 1 ™ the initial conditions of step 1 the 

cluster variable (the cluster ID) specifies the model from algorithm is deterministic and the solution is determined by 

which the data items that belong to that cluster are drawn. the choice of an initial or starting point. In both K-means and 

Hence the data is assumed to arrive from a mixture model fc r M > tDere . * a Sftntet luat the procedure will converge 

and the mixing labels (cluster identifiers) are hidden. af ! er a finite number of iterations. Convergence is to a local 

. . . . j ■ w u • ir i . 25 mmimum of the objective function (likelihood of the data 

In general, a mixture model M having K clusters . . ... . , . , 

„. . f V • u u-r. . j . • . given the model) and the particular local minimum is 

Slows-' " aSSlgnS * pr0babUlty '° " da,S P01M X 85 determined by the initial starting point (step 1). It has been 

well-known that clustering algorithms are extremely sensi- 
tive to initial conditions. Most methods for guessing an 

Pr{x\M) = y w ■■ piix\a M) 30 m ^^ a l solution simply pick a random guess. Other methods 

±i ' ' place the initial means or centroids on uniform intervals in 

each dimension. Some methods take the mean of the global 
data set and perturb it K times to get the K initial means, or 

where W f are called the mixture weights. The problem of simply pick K random po i nt s from the data set. In most 

clustering is identifying the properties of the clusters Ci. 35 situations, initialization is done by randomly picking a set of 

Usually it is assumed that the number of clusters K is known starting points from the range of the data, 

and the problem is to find the best parameterization of each j n the above clustering framework, a solution of the 

cluster model. A popular technique for estimating the param- clustering problem is a parameterization of each cluster 

eters is the EM algorithm. model. This parametrization can be performed by determin- 

There are various approaches to performing the optimi- 40 the modes ( max ima) of the joint probability density of 

zation problem of finding a good set of parameters. The most the data and p i ac ing a cluster at each mode. Hence one 

effective class of methods is known as the iterative refine- approach to clustering is to estimate the density and then 

ment approach. The basic algorithm goes as follows: proceed to find the "bumps" in the estimated density. Den- 

1. Initialize the model parameters, producing a current sity estimation using some technique like kernel density 
model. 45 estimation is difficult, especially in high-dimensional 

2. Decide memberships of the data items to clusters, spaces. Bump hunting is also difficult to perform, 
assuming that the current model is correct. The K-means clustering process is a standard technique 

3. Re-estimate the parameters of the current model assum- for clustering and is used in a wide array of applications in 
ing that the data memberships obtained in 2 are correct, pattern recognition, signal processing, and even as a way to 
producing new model. 50 initialize the more expensive EM clustering algorithm. The 

4. If the current model and new model are sufficiently K-means procedure uses three inputs: the number of clusters 
close to. each other, terminate, else go to 2. K, a set of K initial starting points, and the data set to be 

As an example a so-called K-Means clustering evaluation clustered. Each cluster is represented by its mean (centroid). 

starts with a random choice of cluster centroids or means for Each data item is assigned membership in the cluster having 

the clusters. In a one dimensional problem this is a single 55 the nearest mean to it (step 2). Distance is measured by the 

number for the average of the data points in a given cluster Euclidean distance (or L2 Dorm). 

but in an n dimensional problem, the mean is a vector of n For a data point (d-dimensional vector) x and mean //, the 

components. Data items are gathered and are assigned to a distance is given by: 
cluster to based on the distance to the cluster centroid. Once 

the data points have been assigned the centroids are recal- 60 r~ d 

culated and the data points are again reassigned. Since the Dist C^ ") = J Z ~ u >f 
centroid location (in n dimensions) will change when the 
centroids are recalculated (recall they were randomly 

assigned the first iteration and its unlikely they are correct) A cluster model is updated by computing the mean of its 

some data points will switch centroids. The centroids are 65 members (step 3). 

then again calculated. This process terminates when the To specify the algorithm in the framework introduced so 

assignments and hence centroids cease to change. The far we need to describe the model used. The model at each 
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cluster is assumed be a Gaussian, For each cluster, the FIG. 7 is a schematic depiction of the refinement method 

Gaussian is centered at the mean of the cluster. It is assumed of this invention in accordance with an exemplary embodi- 

to have a diagonal covariance with constant entries on the ment of the invention for clustering components used in 

diagonals of all the clusters. Note that a harsh cluster evaluating data from a database; and 

membership decision for a data point leads to assigning a 5 pic. 8, which comprise FIGS. 8A and 8B, depicts two 

point to the nearest cluster (since in this case the Euclidean clustering solutions, one of which is the result of randomly 

distance is proportional to probability assigned to the point assigning a starting point and a second of which uses a 

by the cluster). Finally, the mixture weights (W t ) in K-means refined starting point as depicted in FIG. 7. Note how the 

are all assumed equal. refined starting point leads to the optimal clustering solution 

Note that by this definition, K-means is only defined over 10 m tn i s case, while the random starting point leads to a bad 

numeric (continuous-valued) data since the ability to com- solution, 
pute the mean is a requirement. A discrete version of 

K-means exists and is sometimes referred to as harsh EM. DETAILED DESCRIPTION OF EXEMPLARY 

EMBODIMENT OF THE INVENTION 

SUMMARY OF THE INVENTION SUMMARY 15 

OF THE INVENTION ^* tn re f erence to FIG. 1 an exemplary data processing 

system for practicing the disclosed data clustering analysis 

The present invention concerns a method and apparatus includes a general purpose computing device in the form of 
that adjusts the initial starting point to a point that is likely a conventional computer 20, including one or more process- 
to be nearer to modes of the data distribution. Of course, this i n g units 21, a system memory 22, and a system bus 23 that 
is what clustering itself is about, so the challenge is to 20 couples various system components including the system 
perform this starting point determination without requiring memory to the processing unit 21. The system bus 23 may 
as much work as clustering. be any of several types of bus structures including a memory 

An exemplary embodiment of the invention includes the bus or memory controller, a peripheral bus, and a local bus 

steps of iteratively obtaining a multiple number of data 2S using any of a variety of bus architectures, 

portions from a database stored on a storage medium and The system memory includes read only memory (ROM) 

performing clustering on each of these relatively small data 24 and random access memory (RAM) 25. A basic input/ 

portions to provide a multiple number of clustering solu- output system 26 (BIOS), containing the basic routines that 

tions. These clustering solutions are considered as a set of helps to transfer information between elements within the 

candidate starting points from which a "best starting point" 3Q computer 20, such as during start-up, is stored in ROM 24. 

is to be derived for the full clustering analysis that is then Xhe computer 2 0 further includes a hard disk drive 27 for 

performed. reading from and writing to a hard disk, not shown, a 

The process of choosing which of the clustering solutions magnetic disk drive 28 for reading from or writing to a 

to use as a starting point involves an additional analysis of removable magnetic disk 29, and an optical disk drive 30 for 

the multiple solutions. In an exemplary embodiment of the 35 reading from or writing to a removable optical disk 31 such 

invention the multiple clustering solutions from the initial as a CD ROM or other optical media. The hard disk drive 27, 

data samples are each chosen as a starting point for a magnetic disk drive 28, and optical disk drive 30 are 

clustering of all the candidate solutions. A "best" solution is connected to the system bus 23 by a hard disk drive interface 

derived from this second set of clusterings. The best solution 32, a magnetic disk drive interface 33, and an optical drive 

is returned as the refined (improved) starting point to be used 40 interface 34, respectively. The drives and their associated 

in clustering the full data set. * computer-readable media provide nonvolatile storage of 

These and other objects are advantages and features of the computer readable instructions, data structures, program 

invention and will be described in conjunction with an modules and other data for the computer 20. Although the 

exemplary embodiment of the invention which is described exemplary environment described herein employs a hard 

in conjunction with the accompanying drawings. 45 disk, a removable magnetic disk 29 and a removable optical 

disk 31, it should be appreciated by those skilled in the art 

BRIEF DESCRIPTION OF THE DRAWINGS that other types of computer readable media which can store 

FIG. 1 is a schematic depiction of a computer system used data that ^ accessible by a computer, such as magnetic 

in implementing the present invention; cassettes, flash memory cards, digital video disks, Bernoulli 

ftp ->- fl u i a • *- t c u 50 cartridges, random access memories (RAMs), read only 

FIG. 2 is a flow chart depicting of a process of choosing . /nr\\jt\ a .l n 1 u j ■ *u 

• * memories (ROM), and the like, may also be -used in the 

a clustering starting point: , v / , \ 

. & & *\ ' . exemplary operating environment. 

FIG. 3 is a schematic of clustering components used for A L * , , LiJ ...j 

evaluating data from a database- A number of P™gram modules may be stored on the hard 

™L \ f v!„ database, disk , mag netic disk 29, optical disk 31, ROM 24 or RAM 25, 

FIGS 4A and 4B are two-dimensional depictions show- 55 mcludmg an operating system 35, one or more application 

ing a full sample of data and a subsample of that same data programs 36 , other program modu i es 37> and program data 

^illustrate the mtuition behind our method for homing in 38 A ^ may enter commands and information into the 

efficiently on the likely modes of the data distribution. computer 20 through input devices such as a keyboard 40 

FIGS. 5Aand 5B are two-dimensional depictions of two and pointing device 42. Other input devices (not shown) 

clustering results on two data samples from the same data- 60 may include a microphone, joystick, game pad, satellite 

base to illustrate the sensitivity of the K-means clustering d ish, scanner, or the like. These and other input devices are 

algorithm to choice of random sample given the same often connected to the processing unit 21 through a serial 

starting condition; p ort interface 46 that is coupled to the system bus, but may 

FIG. 6 is a two-dimensional depiction of multiple solu- be connected by other interfaces, such as a parallel port, 

tions to a clustering of a small subsets of data from a 65 game port or a universal serial bus (USB). A monitor 47 or 

database to illustrate why a second clustering of the cluster other type of display device is also connected to the system 

solutions is needed; bus 23 via an interface, such as a video adapter 48. In 
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addition to the monitor, personal computers typically data will naturally bias the sample to have representatives 

include other peripheral output devices (not shown), such as only from the modes This subsampling cannot, however, 

speakers and printers. guard against the possibility of points from cluster tails 

The computer 20 may operate in a networked environ- being sampled. Such data would give a very skewed view of 

ment using logical connections to one or more remote 5 tne data. 

computers, such as a remote computer 49. The remote FIGS. 4A and 4B illustrate the fact that subsampling the 

computer 49 may be another personal computer, a server, a data emphasizes the modes. These two figures show data 

router, a network PC, a peer device or other common drawn from a source containing two Gaussians (clusters) in 

network node, and typically includes many or all of the 2 dimensions. FIG, 4 A shows a given data sample, while 

elements described above relative to the computer 20, 10 FIG. 4B illustrates what a randomly drawn very small 

although only a memory storage device 50 has been illus- subsample looks like. Each of the points in FIG. 4B may be 

trated in FIG. 1. The logical connections depicted in FIG. 1 thought of as a "guess" for the possible location of a mode 

include a local area network (LAN) 51 and a wide area in the underlying distribution. The estimates are fairly 

network (WAN) 52. Such networking environments are varied, but they exhibit "expected" behavior. The subsam- 

commonplace in offices, enterprise-wide computer 15 pling produces a good separation between the two clusters, 

networks, intranets and the Internet. The invention performs a clustering on the small sample 

When used in a LAN networking environment, the com- (FIG. 4B) and then uses the result of such clustering to 

puter 20 is connected to the local network 51 through a produce the refined starting point 74. This solution still 

network interface or adapter 53. When used in a WAN exhibits noisy estimates due to small samples, especially in 

networking environment, the computer 20 typically includes 20 skewed distributions in databases having high dimensions, 

a modem 54 or other means for establishing communica- FIGS. 5A and 5B show the result of clustering two 

tions over the wide area network 52, such as the Internet. different subsamples drawn from the same distribution, and 

Tne modem 54, which may be internal or external, is initialized with the same starting point. Cluster centroids are 

connected to the system bus 23 via the serial port interface designated with a on these Figures. The variance in result 

46. In a networked environment, program modules depicted 25 illustrated by these depictions is fairly common even in low 

relative to the computer 20, or portions thereof, may be dimensions using data from well-separated Gaussians. 

stored in the remote memory storage device. It will be These figures also illustrate the importance of the problem of 

appreciated that the network connections shown are exem- having a good initial or starting point. Each of the two data 

plary and other means of establishing a communications link clusters depicted in FIGS. 5A and 5B depict clustering from 

between the computers may be used. 30 2 different samples of the same size that were obtained from 

FIG. 3 presents an overview of the processing compo- the same database, 

nents used in clustering data. These processing components FIG. 5A also illustrates another fairly common phenom- 
perform data clustering on a data set or database 60 of enon with K-means clustering: some of the clusters may 

records stored on a storage medium such as the computer 35 remain empty or have only single outlier points causing the 

system's hard disk drive 27. The data records typically are other clusters to have worsened means (cluster centers), 
made up of a number of data fields or attributes. For 

example, a marketing database might contain records with Clustering the Solutions 

attributes chosen to illustrate a purchasers buying power, The refinement component 72 of FIG. 7 functions in 

past tastes and presents needs with one record for each 4Q accordance with the flow chart of FIG. 2 and addresses the 

purchaser. problem of noisy estimates. In accordance with the disclosed 

The components that perform the clustering require three embodiment of the invention, the refinement component 72 

inputs: the number of clusters K, a set of K initial starting of FIG. 3 obtains 102 multiple subsamples 103, and each 

points 62, and the data set 60 to be clustered. The clustering such subsample is clustered 104 separately. This is done 

of data by these components produces a final solution 70 as 45 iteratively producing a total of J solutions which are viewed 

an output. Each of the K clusters of this final solution 70 is as J candidate starting initial points, 

represented by its mean (centroid) where each mean has d As illustrated in the FIG. 2 flowchart, these multiple 

components equal to the number of attributes of the data solutions are then used to produce an optimum solution. One 

records. possible use of the multiple solutions would be take the 

A refinement component 72 produces 'better' starting 50 superposition of the solution (e.g. the mean of the centroids 

points 74 based on a limited sampling of data from the data obtained for each cluster) to reduce the J candidate starting 

set 60 to be clustered. Since the entire clustering process can points for a cluster to one candidate point. However, to 

be viewed as a "refinement of a starting point", the refine- perform this superposition or averaging, one needs to first 

ment component 72 must run in significantly less time than solve a correspondence problem: decide a partition on the J 

the overall clustering that is performed by a clustering 55 solutions (recall each of the J solutions consists of K points) 

component 76 that utilizes the refined starting points 74 to such that the J points in each block of the K-partition belong 

provide the clustering solution 70. together. This is illustrated in FIG. 6. 

A most favorable refined starting point produced by the FIG. 6 shows four solutions obtained for K=3, J-4. 

refinement component 72 would be to move the set of Suppose that the "true" solution to the problem is where the 

starting points 62 to a set of refined points 74 that are closer 60 X's are shown in FIG. 6. The A's show the 3 cluster 

to the modes of the data distribution. This appears to be as centroids obtained from the first sample, B's second, C's 

difficult a problem as the problem of clustering the data (or third, and D's fourth. The problem is how do you know that 

density estimation) itself. What is needed is a "cheap" Dl is to be grouped with Al but A2 does not also go with 

method of making an approximate guess at where the modes that group? 

are. An exemplary embodiment of the invention makes a 65 Practice of the invention utilizes clustering as a way to 

rough guess based on a small sample of the data from the solve this correspondence problem. Returning to FIG. 2, the 

database 60. The intuition here is that severely subsampling starting point 74 is derived from a clustering 110 of the J 
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solutions 106 to assess the correspondence. This can be 
viewed as a special form of voting scheme to combine 
solutions. 

Practice of the invention utilizes clustering as a way to 
solve this correspondence problem. Returning to FIG. 2, the 5 
starting point is derived from a clustering 110 of the solu- 
tions 104 to assess the correspondence. This can be viewed 
as a special form of voting scheme to combine solutions. 
Psuedo-code for the FIG. 2 flowchart is as follows: 

10 



Algorithm Refine(SP, Data, K, J) 

0. CM - $ 

1. For . . ., J 

a. Let S; be a small random subsample of Data 

b. Let CM; = KMeans_Mod(SP J S i} K) 

c. CM - U CM; 

2. FMS - 4> 

3. For . . J 

a. Let FM ; - KMeans(CM;, CM, K) 

b. Let FMS = FMS <J FM : 

4. Let FM = ArgMin{Distortion(FMi, CM)} 



15 



20 



5. Keturn (FM) 



The above algorithm utilizes three calls: Kmeans( ), 25 
KMeansMod( ) and OistortionO. The function KMeans is a 
call to the K-means clustering procedure described previ- 
ously which takes as arguments: a starting point, a data 
subset, and the number of clusters K, 

KmeansModO is a slightly modified version of K-means, 30 
and takes exactly the same arguments as KmeansO. The 
modification is as follows: If at the end of the clustering 
process, any of the clusters have zero membership, then the 
corresponding initial guess at this cluster centroid is set to 
the data point farthest from its assigned cluster center. This 35 
procedure decreases the likelihood of having empty clusters 
after reclustering from the "new" initial point. Resetting the 
empty centroids to another point may be done in a variety of 
ways. In accordance with an exemplary embodiment, an 
empty centroid is mapped to the data point having farthest 40 
distance to its corresponding mean. However, other methods 
are contemplated including: picking a random point from the 
data, picking a new point consisting of the mean of the entire 
data set, picking the mean of the entire data set but perturing 
that mean by a small random amount corresponding to a 45 
variance in the data mean in each dimension. 

Choosing a best solution 120 is determined by the dis- 
tortion function. This function measures the degree of fit of 
a set of clusters to the given data set. This is simply the sum 
of distance-squared between each data point and the center 50 
of the cluster to which it is assigned. It can be shown 
formally that K-means, viewed as an optimization 
technique, lands on local minima in this objective function. 

Assume that 100 iterations (J=100, FIG. 2) of data sam- 
pling are performed at the step 102 of FIG. 2. This produces 55 
100 solutions. Assume also that the original clustering of the 
samples was performed using K«cluster number=5. The 
solutions are five d dimensional vectors (d=attritubutes in a 
data record for example). At a step 108 each of the 100 
solutions CM,- found at the step 104 is used as a starting point 60 
in clustering all the 100 solutions. Recall that to initialize the 
K-means clustering, we need a starting condition consisting 
of K initial points. So for example, solution #1 (which 
consists of K cluster centroids) can be used in a straight- 
forward manner to cluster the "data set" consisting of all 100 65 
solutions (this "data set" has 100 * K points in it since each 
solution consists of K cluster centroids). The first centroid 



vector of cluster #1 is used as a starting point to cluster all 
100 * K vectors resulting from putting all the solutions 
together. Next repeat the clustering using the K centroids of 
solution #2 as starting points to cluster the same 100 * K 
points, and so forth. Each of these 100 clusterings of the set 
of solutions results in a set of K centroids. At the step 120 
the refinement component 72 chooses the result that has the 
minimal distortion over the set of 100 * K data points. The 
FM t that produces this best result is chosen as the refined 
starting point for clustering by the clustering component 76 
of FIG. 3. 

This refinement process is illustrated in the diagram of 
FIG. 7 where multiple solutions are shown and are used 
during an additional clustering step to produce the best 
starting point. The discussion on clustering the 100 * K data 
set is an example of the second clustering step. 

The invention contemplates alternatives to this exemplary 
process for choosing the best refined starting point. These 
alternates include: 

1. Pick the solution from CM,, that had best distortion 
from the first set of data wherein 'best' is taken as the 
distortion over all the points that are sampled for all J 
iterations. This will bypass the secondary clustering, 

2. Do a correspondence between cluster centroids of the 
solutions (i.e. map each solution in each of the 100 
clusterings to its corresponding centroid in all the 
others) then pick a final solution by taking: a. centroids 
of each cluster (direct average), and b. a statistical 
voting scheme, some modified robust estimation 
method that is less sensitive to outliers, (note in the 
exemplary embodiment of the invention the second 
clustering stage deals with this "matching" problem 
efficiently). 

Scaling to Large Databases 

The invention can be used as a preprocessor to any 
clustering algorithm simply by refining the starting point 
prior to clustering. However, as a database size increases, 
especially its dimensionality, efficient and accurate initial- 
ization becomes critical. A clustering session on a data set 
with many dimensions and tens of thousands or millions of 
records can take hours or days. 

Since the data samples are small, the refinement step runs 
very fast. For example, if 1% or less of the data is sampled, 
the invention can run trials over 10 samples in time com- 
plexity that is less than 10% of the time it will take the full 
clustering. In fact if the user is willing to spend as much time 
refining the means as is needed to run the clustering 
problem, then many more samples can be processed. In 
general, the time it takes to cluster a data set is proportional 
to its size. There is a factor that multiplies this complexity 
which has to do with the number of iterations required by the 
clustering algorithm. 

If for a data set D, a clustering algorithm requires Iter(D) 
iterations to cluster it, then time complexity is |D| * Iter(D). 
A small subsample Si=D, where |S|«|D|, typically requires 
significantly fewer iteration to cluster. Empirically, it is safe 
to expect that Iter(S)<Iter(D). Hence, given a specified 
budget of time that a user allocates to the refinement process, 
simply determine the number J of subsamples to use in the 
refinement process. When |D| is very large, and |S| is a small 
proportion of |D|, refinement time is essentially negligible, 
even for large J. 

Another desirable property of practice of the invention is 
that it easily scales to very large databases. Since the only 
requirement is to able to hold a small sample in memory and 
solutions computed over previous small samples, the algo- 
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rithm can run with a small buffer of RAM and can be used 
on huge databases. Of course, we assume that it is possible 
to obtain a random sample from a large database. This may 
not be straightforward. Unless one can guarantee that the 
records in a database are not ordered by some property, 
random sampling can be as expensive as scanning the entire 
database. Note that in a database environment, what one 
thinks of as a data table (i.e. the entity familiar to machine 
learning researchers) may not exist as a physical table in a 
database. It can be a result of a query which involves joins, 
groupings, and sorts. In many cases this imposes special 
ordering on the result set, and randomness of the data view 
cannot be assumed. 

Example Data 

To illustrate how K-means is sensitive to initial 
conditions, consider the data shown in FIGS. 8A and 8B. 
FIG. 8 A shows how K-means behaves from some random 
starting point. This data set shows three Gaussians in 2-D. 
Note that the Gaussians in this case happen to be centered 
along a diagonal. The reason for this choice is that even as 
the dimensionality of the data goes higher, then any 2-D 
projection of the the higher dimensional data will have this 
same form. This makes the data set easy for a visualization - 
based approach. Simply project the data to two dimensions, 
arid the clusters reveal themselves. This is a rare property 
since if the Gaussians are not aligned with the main diagonal 
dimension axes, any lower-dimensional projection will 
result in overlaps and separability in 2-D is lost. 

FIG. 8 A shows use of random starting points (S) and the 
corresponding K-means solution centroids K. FIG. 8B 
shows the same initial random starting points (S) and the 
clustering centroid R using the disclosed refinement proce- 
dure. Note that in this case the refinement got us extremely 
close to the true solution. Running K-means from the refined 
point converges on the true solution. 

It is important to point out that this example is for 
illustrative purposes only. The interesting cases are the 
higher-dimensional data sets from large database having 
much higher dimension (100 dimensions and more). 

Other Clustering Algorithms 

The refinement method discussed thus far have been in 
the context of the K-means algorithm. However, the inven- 
tion can be generalized to other algorithms, and even to 
discrete data (on which means are not defined). If a given 
procedure A is being used to cluster the data, then A is also 
used to cluster the subsamples. The procedure A will pro- 
duce a model. The model is essentially described by its 
parameters. The parameters are in a continuous space. The 
stage which clusters the clusters (i.e. step 3 of the above 
pseudo-code ) remains as is; i.e. we use the K-means 
algorithm in this step. The reason for using K-means is that 
the goal at this stage is to find the "centroid" of the models, 
and in this case the harsh membership assignment of 
K-means is desirable. The generalized algorithm is given as 
follows: 
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Algorithm Generalized_Refine(SP, Data, K, J) 

0. CM - $ 

1. For i-l, . . J 

a. Let Sj be a small random subsample of Data 

b. Let CMj - Clustcr(SP, S it K) 

c. CM-CMU CM ; 
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-continued 



2. fms - <t> 

3. For i-1, . . J 

a. Let FM| « KMeansPrimeCCMj, CM, K) 

b. Let FMS - FMS U FM S 

4. Let FM - ArgMinfDistortiontFMi, CM)} 

i 

5. Return (FM) 
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Note that the only differences are that step l.e calls some 
other clustering algorithm (ClusterA), and that step 3. a now 
uses KmeanPrime( ) which is defined to cluster model 
parameters rather than data vectors. The difference here is 
that model parameters may be matrices rather than simple 
vectors. This generalization only involves defining the 
appropriate distance metric. 

For example, assume the user is clusterng using the EM 
algorithm and that data is discrete, and hence each cluster 
specifies a multinomial distribution over the data. A multi- 
nomial distribution has a simple set of parameters: for every 
attribute, a vector of probabilities specified the probabilities 
of each value of the attribute given the cluster. Since these 
probabilities are continuous quantities, they have a "cen- 
troid" and K-means can be applied to them. 

While the present invention has been described with a 
degree of particularity, it is the intent that the invention 
include all alterations or modifications from the exemplary 
design falling within the spirit or scope of the appended 
claims. 

We claim: 

1. A method for evaluating data in a database that is stored 
on a storage medium, wherein the database has a data set to 
be evaluated, and wherein the data set is comprised of a 
plurality of records, comprising the steps of: 

a) obtaining a multiple number of data subsets comprising 
a plurality of records from the data set, 

b) performing clustering analysis on the data records that 
make up each of the subsets to provide a multiple 
number of candidate clustering starting points; 

c) choosing one of the multiple candidate clustering 
starting points to be used in clustering the data set to be 
evaluated; and 

d) using the one chosen candidate clustering starting point 
as a starting point to perform clustering analysis on the 
data set to be evaluated. 

2. The method of claim 1 wherein the step of performing 
clustering analysis on each of the subsets is performed from 
an initial starting point which is refined to determine said 
candidate clustering starting point. 

3. The method of claim 1 wherein the step of choosing one 
of the candidate clustering starting points comprises the step 
of performing additional data clustering on the multiple 
number of candidate clustering starting points. 

4. The method of claim 1 wherein the step of choosing one 
of the candidate clustering starting points comprises the step 
of performing additional data clustering on the multiple 
number of candidate clustering starting points and wherein 
one of said multiple number of candidate clustering starting 
points is chosen as a refined clustering starting point to 
perform the data clustering. 

5. The method of claim 2 wherein the initial starting point 
is randomly determined. 

6. The method of claim 2 wherein the step of performing 
clustering analysis on each of the subsets is performed by 
removing sparsely populated clusters, merging the sparsely 
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populated clusters with clusters that are more densely popu- 
lated and determining a new initial starting point for sub- 
sequent clustering of data. 

7. The method of claim 2 wherein if at the end of the 
clustering of each of the subsets, any of the clusters have 
zero membership, then the corresponding initial choice for 
this empty cluster centroid is adjusted to a solution set to a 
data point in the sample farthest from the initial choice that 
resulted in the empty cluster. 

8. The method of claim 2 wherein if at the end of the 
clustering of each of the subsets any of the clusters have zero 
membership a new cluster mean for the empty cluster is 
chosen from the mean of the entire data sample. 

9. The method of claim 2 wherein if at the end of the 
clustering the data records that make up each of the subsets, 
any of the clusters have zero membership a new cluster 
mean for the empty cluster is chosen by picking the mean of 
the entire data set and perturbing that mean by a small 
random amount corresponding to a variance in the data 
mean in each dimension of the data sample. 

10. The method of claim 1 wherein the step of performing 
a clustering analysis on each of the subsets uses a different 
clustering process than the clustering process that is used to 
perform clustering analysis on the data set to be evaluated. 

11. In a computer data mining system, apparatus for 
evaluating data in a database comprising: 

a) one or more data storage devices for storing a database 
of records on a storage medium; 

b) a computer having an interface to the storage devices 
for reading data from the storage medium and bring the 
data into a rapid access memory for subsequent evalu- 
ation; and 

c) said computer comprising a processing unit for evalu- 
ating at least some of the data in the database and for 
clustering the data into multiple numbers of data clus- 
ters; said processing unit programmed to retrieve mul- 
tiple subsets of data from the database, find multiple 
candidate clustering starting points from the multiple 
data subsets retrieved from the database and choosing 
an optimum solution from the multiple number of 
candidate clustering starting points to begin subsequent 
clustering on data in the database. 

12. In a computer database system, a method for use in 
choosing starting conditions in a data clustering procedure 
comprising the steps of: 

a) choosing a cluster number K for use in categorizing the 
data in the database into K different data clusters; 

b) choosing an initial centroid for each of the K data 
clusters; 

c) sampling a portion of the data in the database from a 
storage medium and performing a clustering on the data 
sampled from the database based on the K centroids to 
form K characterizations of the database; 

d) repeating the sampling and clustering steps until a 
plurality of clustering solutions have been determined; 
and 

e) choosing a best solution from said plurality of cluster- 
ing solutions to use as a starting point in further 
clustering of data from the database. 

13. Apparatus for evaluating data in a database that is 
stored on a storage medium, wherein the database has a data 
set to be evaluated, and wherein the data set is comprised of 
a plurality of records, the apparatus comprising: 

a) means for obtaining a multiple number of data subsets 
comprising a plurality of records from the data set, 

b) means for performing clustering analysis on the data 
records that make up each of the subsets to provide a 
multiple number of candidate clustering starting points; 



10 



15 



20 



25 



45 



50 



55 



60 



65 



c) means for choosing one of the multiple candidate 
clustering starting points to be used in clustering the 
data set to be evaluated; and 

d) means for using the chosen starting point to perform 
clustering analysis on the data set to be evaluated. 

14. The apparatus of claim 13 additionally comprising 
means to randomly choose an initial starting point for use in 
clustering the multiple data subsets. 

15. The apparatus of claim 13 additionally comprising 
means to choose an initial starting point for use in perform- 
ing clustering analysis on the multiple data subsets based on 
data contained in the database. 

16. The apparatus of claim 13 wherein the means for 
choosing comprises means for clustering the multiple num- 
ber of candidate clustering starting points using each starting 
point as an interim starting point for clustering said starting 
points. 

17. The apparatus of claim 16 wherein the means for 
choosing comprises means for determining a best refined 
candidate clustering starting point from an intermediate 
clustering of the data sample solutions based upon a distance 
from the data points of said multiple solutions to a set of 
intermediate clustering solutions. 

18. A 'computer- readable medium having computer- 
executable instructions for performing steps for evaluating a 
database wherein the database has a data set to be evaluated, 
and wherein the data set is comprised of a plurality of 
records, comprising the steps of: 

a) obtaining a multiple number of data subsets comprising 
a plurality of records from the data set, 

b) performing clustering analysis on the data records that 
make up each of the subsets to provide a multiple 
number of candidate clustering starting points; 

c) choosing one of the multiple candidate clustering 
starting points to be used in clustering the data set to be 
evaluated; and 

d) using the chosen starting point to perform clustering 
analysis on the data set to be evaluated. 

19. A method for evaluating data in a database that is 
stored on a storage medium, wherein the database has a data 
set to be evaluated, and wherein the data set is comprised of 
a plurality of records, comprising the steps of: 

a) obtaining a multiple number of data subsets comprising 
a plurality of records from the data set, 

b) performing clustering analysis on the data records that 
make up each of the subsets to provide a multiple 
number of candidate clustering starting points; 

c) choosing a clustering starting point based on the 
multiple candidate clustering starting points to be used 
in clustering the data set to be evaluated; and 

d) using the chosen starting point to perform clustering 
analysis on the data set to be evaluated. 

20. The method of claim 19 wherein each of the multiple 
number of candidate clustering starting points is used as an 
intermediate starting point for clustering the candidate start- 
ing points and wherein a resulting solution of said clustering 
of candidate starting points is used as a refined clustering 
starting point. 

21. The method of claim 20 wherein the refined clustering 
starting point is chosen by determining a distance from 
multiple intermediate clustering solutions and a set of data 
points made up of the candidate clustering starting points. 

22. The method of claim 21 wherein the refined clustering 
starting point is an optimum solution to the clustering of the 
multiple candidate clustering starting points wherein opti- 
mum is based on said distance. 
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23. The method of claim 1 wherein the step of choosing 
one of the multiple candidate clustering starting points is 
performed by: 

a) choosing a candidate clustering starting point; 

b) using the chosen candidate clustering starting point to 
perform an interim clustering analysis on an interim 
data set; 

c) determining the quality of the cluster solution resulting 
from the interim clustering analysis; 

d) repeating steps a) through c) until all candidate clus- 
tering starting points have been used in interim clus- 
tering analysis; and 
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e) selecting the candidate starting point which yielded the 
best quality cluster solution as a refined starting point 
for performing clustering analysis on the data set to be 
evaluated. 

24. The method of claim 23 wherein the interim data set 
is a plurality of unchosen candidate starting points. 

25. The method of claim 23 wherein the interim data set 
is a plurality of data records that make up the data subsets. 

26. The method of claim 23 wherein the step of deter- 
mining the quality of the interim clustering analysis if 
performed by calculating a degree of fit of the chosen 
candidate starting point with the interim data set. 
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